Introduction:

Reporting patient outcomes by CAR-T treatment centers is part of the accreditation requirement under the FACT (Foundation for Accreditation of Cell Therapy) Immune Effector Cell (IEC) Program standard. Currently, this is a time-consuming and manual process, making it increasingly difficult to sustain with the growing CAR-T applications and patient volume. The objective of this study is to evaluate how a Large Language Model (LLM) can be used to automate the abstraction and summarization of medical record for CAR-T toxicities.

Methods:

Clinical data from patients treated with CAR-T between January 2018 and June 2025, including hematology clinical notes, lab results, flowsheets, vital signs, and medication administration record, were used as the source of patient information. The American Society of Transplant and Cell Therapy (ASTCT) CAR-T toxicity grading guideline for cytokine release syndrome (CRS) and immune effector cell associated neurologic syndrome (ICANS) were used to define these toxicities.

With the large context window of the LLM (Gemini 2.5 pro), we created a prompt by embedding both the patient data and the ASTCT guidelines for CRS and ICANS, along with an outline of specific information we aimed to abstract from the data. Additionally, the LLM was instructed to provide contextual justification for the data it extracted from patient information.

The LLM-generated output included event onset date, resolution date, maximum grade, date of maximum grade, and medications given for management. The CRS and ICANS outcome reports generated by the LLM were then compared to the IEC compliance program data. Any discrepancies identified were reviewed manually.

Results:

Among patients (pt) treated with CAR-T in the specified timeframe, 47 were selected as training cohort. This cohort included pt with toxicities events, no events, and a mix of events of varying grades. For the test cohort, 66 pts who received CAR-T from March to June 2025 were included. Demographics for the training and test cohorts are reflective of evolving clinical practice (training, test cohorts: median age 69, 65; male 55%, 74%; lymphoma/leukemia 66%, 48%; multiple myeloma 34%, 52%; CD19 CAR-T 46%, 50%).

Compared to data from the IEC compliance program, the LLM-generated summaries for CRS events demonstrated an accuracy of 90%, precision 95%, sensitivity 100%, specificity 80%, F1 score 97%, and Matthew's correlation coefficient 87%. For ICANS event, the LLM achieved an accuracy of 91%, precision 92%, sensitivity 92%, specificity 90%, F1 score 92%, Matthew's correlation coefficient 83%. Maximum grade, for both CRS and ICANS, were identified with >90% accuracy. However, dates for CRS and ICANS were less accurate, with the lowest accuracy noted in identifying the event resolution date, 45% for CRS and 67% for ICANS.

Medical records were reviewed for all discrepancies between the LLM and the IEC Compliance Program data. The most common discrepancy was a mismatch in event resolution date, determined to be due to differing criteria used by the compliance team and the LLM prompt. Of note, a few discrepancies were attributed to missed information by the IEC Compliance Team, which was correctly identified and categorised by the LLM.

An updated LLM prompt (LLMv2) was developed and applied to the test cohort. In version 2, the prompt was revised to include clearer instructions on how to summarize CRS and ICANS events, as well as new instructions on additional data elements to be summarized. The performance of LLMv2 on this cohort for CRS event identification was 100% across all categories (accuracy, precision, sensitivity, specificity, F1 score, and Matthew's correlation coefficient). For ICANS the LLMv2 achieved accuracy of 96%, precision 69%, sensitivity 100%, specificity 93%, F1 score 82%, Matthew's correlation coefficient 80%. Accuracy in identifying event resolution date improved significantly from 64% to 99%.

Conclusions:

Our study demonstrates that LLMs can significantly reduce the time required to abstract and summarize CAR-T toxicities, while maintaining high accuracy. One limitation of our study is that it is based on data from a single institution. However, plans are already underway to evaluate the use of LLMs in a multi-institutional setting.

This content is only available as a PDF.
Sign in via your Institution